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Software Scalability Issues in Large Clusters 



m 
o 
o 

(N 
>v 

S 

in 

S 1 

Oh- 

S: 
o ■ 
o ; 

o ■ 

• i-H . 

C/3 , 
>V 
43 ' 

<n : 
> ■ 

in ■ 

o : 
o . 

in ■ 
o ■ 
m ' 
O . 

o : 

• i— i 

43 : 

Ph. 



X 



A. Chan, R. Hogue, C. Hollowell, 0. Rind, T. Throwe, T. Wlodek 
Brookhaven National Laboratory, NY 11973, USA 

The rapid development of large clusters built with commodity hardware has highlighted scalability 
issues with deploying and effectively running system software in large clusters. We describe here our 
experiences with monitoring, image distribution, batch and other system administration software 
tools within the 1000+ node Linux cluster currently in production at the RHIC Computing Facility. 



I. BACKGROUND 

The rapid development of large clusters built with 
affordable commodity hardware has highlighted the 
need for scalable cluster administration software that 
will help system administrators to deploy, maintain 
and run large clusters. Scalable cluster administra- 
tion software is critical for the efficient operation of 
the 2000+ CPU Linux Farm at the RHIC Computing 
Facility (RCF), and this paper describes our experi- 
ence with cluster administration software currently in 
use at the RCF. 

The RCF is a large scale data processing facility at 
Brookhaven National Laboratory (BNL) for the Rel- 
ativistic Heavy Ion Collider (RHIC), a collider ded- 
icated to high-energy nuclear physics experiments. 
The RCF provides for the computational needs of the 
RHIC experiments (BRAHMS, PHENIX, PHOBOS, 
PP2PP and STAR), including batch, mail, printing 
and data storage. In addition, BNL is the U.S. Tier 
1 Center for ATLAS computing, and the RCF also 
provides for the computational needs of the U.S. col- 
laborators in ATLAS. 

The Linux Farm at the RCF provides the majority 
of the CPU power in the RCF. It is currently listed as 
the 3 rd largest cluster, according to "Clusters Top500" 
(http://clusters.top500.org). Figure 1 shows the rapid 
growth of the Linux Farm in the last few years. 

All aspects of its development (hardware and soft- 
ware), operations and maintenance are overseen by 
the Linux Farm group, currently a staff of 5 FTE 
within the RCF. 



II. HARDWARE 



TABLE I: Linux Farm hardware 



Brand 


CPU 


RAM 


Storage 


Quantity 


VA Linux 


450 MHz 


0.5-1 GB 


9-120 GB 


154 


VA Linux 


700 MHz 


0.5 GB 


9-36 GB 


48 


VA Linux 


800 MHz 


0.5-1 GB 


18-480 GB 


168 


IBM 


1.0 GHz 


0.5-1 GB 


18-144 GB 


315 


IBM 


1.4 GHz 


1 GB 


36-144 GB 


160 


IBM 


2.4 GHz 


1 GB 


240 GB 


252 



the hardware failures by category is shown in Figure 
2. 



III. MONITORING SOFTWARE 

Monitoring and control of the cluster hardware, 
software and infrastructure (power and cooling) is 
provided via a mix of open-source software, RCF- 
designed software and vendor-provided software. Var- 
ious components of the monitoring software have been 
redesigned for scalability purposes and to add various 
persistent and fault-tolerant features to provide near 
real-time information reliably. Figure 3 shows the two 
monitoring models used in the design of the various 
components of the monitoring software. 

Figure 4 shows the auto-updating Web-interface to 
the RCF-designed monitoring software, which allows 
us to view individual server status within the cluster. 
Figure 5 is the Web-interface to the complementary 
open-source ganglia [1] monitoring software, which 
provides monitoring information in a user-friendly for- 
mat via a Web interface. 



The Linux Farm is built with commercially avail- 
able thin rack-mounted, Intel-based servers (1-U and 
2-U form factors). Currently, there are 1097 dual- 
CPU production servers with approximately 917,728 
Speclnt2000. Table 1 summarizes the hardware cur- 
rently in service in the Linux Farm. Hardware reliabil- 
ity has not been an issue at the RCF. The average fail- 
ure rate is 0.0052 failures /(machine ■ month), which 
translates to 5.7 hardware failures per month at its 
present size. Hardware failures are dominated by disk 
and power supply failures. A detailed breakdown of 



IV. IMAGE DISTRIBUTION SOFTWARE 

The RCF used a Linux image distribution system 
that relied on an image server that made the image 
available via NFS to the Linux Farm servers. The im- 
age was downloaded and deployed during the rebuild- 
ing process. That system worked well until the Linux 
Farm grew beyond ~ 150 servers, at which point NFS 
limitations prevented this system from reliably up- 
grading a large number of servers simultaneously in 
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FIG. 1: The growth of the Linux Farm at the RCF. 
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FIG. 2: Breakdown of hardware failure by category. 
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FIG. 4: System status of Linux Farm nodes. 
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FIG. 5: Detailed view of a STAR node with ganglia. 
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FIG. 6: MySQL usage in the RCF Linux Farm. 
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an acceptable time interval. 

In 2001, the RCF switched to RedHat's Kickstart 
installer [2] . Kickstart allowed us to switch to a Web- 
based image server using standard rpm packages. It 
has proven very scalable (20 minutes/server with 100's 
of servers rebuilt simultaneously) and highly flexible. 
Multiple images can co-exist with different install op- 
tions. 

Both the old and the new Kickstart image distribu- 
tion systems rely on a secure MySQL [3] database sys- 
tem for server authentication and configuration spec- 
ification. 



V. DATABASE SYSTEMS 



administration purposes such as installing or updat- 
ing software. The scripts are also used for automatic 
emergency remote power access to the servers during 
infrastrucure system failures (UPS or cooling). Be- 
cause of the possibility of catastrophic disasters in the 
case of infrastructure failures (for example, an electri- 
cal fire), it is important that the scripts perform fast, 
parallel and controlled shutdown of the Linux Farm 
servers. Figure 9 shows how our PYTHON-based 
scripts fork multiple processes to become a scalable 
system administration tool. 

The RCF also uses vendor-provided software for 
cluster management. Figure 8 is an example of the 
user-interface to a software system that is currently 
deployed in the RCF Linux Farm. 



The RCF uses the open-source, lightweight MySQL 
database software throughout the facility. MySQL en- 
joys wide support in the open-source community, and 
it is well documented. It is used as the general back- 
end data repository for monitoring, batch control and 
cluster management purposes at the RCF. MySQL is 
a scalable and easily configurable database software. 
Figure 6 shows how MySQL databases are configured 
in the Linux Farm to achieve high scalability and relia- 
bility for monitoring and batch control purposes. Fig- 
ure 7 shows the PERL-TK interface to the MySQL 
database for our batch control and monitoring soft- 
ware. 



■ f; 












T-VIV 




Ji4i Iji.I lAiiiiici JUlfli 






nrsUIU.KT.BM 


sw -svbl l'±l>: 




L J 








nrsUin.KT.BM 


IW -]Vbl Ijalc 




L J 








nrsMlo.KT.BM 


iw -svsl l-±l>: 




: :i 








ru-nOTI- . n:f .In 1 


Hiv :< *-,i UiIh 




r s 






J 


nr>«l1!i.n:f .In 1 


„,, >dlj,l. 




< s 








ria"KHnK, n:l-. In 1 


niv aKiilUilH 




* x 








prnrlinV r-rt hrl 


F.vj .iiul \.*\lt> 




:i 








nrsuI12.rcr.BM 


l«/ -svsl l-±l>: 




L 'J 








nrsUllS.KT.Brl 


iw -ivsl T±l>: 




L 'J 






/ 


m-rf<lH).iT:f.bi 1 


iw mi Mi la 




! 5 




Hi." 


mil | UcWJi 


ai | h | 


in Ma'Kuiu-^WjIJ itAiP. iJW JJiV 




-liipfe ll> 


KlitllW 


Miih 


.kfc 1 IH 




X 








n[Hi,JKH JlP.i Luln-J in JM[t 






jiiii_.imi mi "fl. n 


i" mi 


iiinStlli 


nriinjJisiiji!jiijii^ii:=j!"nisi 






jnii_i™ iwJia.n 


«»»i;ii- 


rnW 


nriHi^3isiiji!jiUiipKii:=j!"nisi 






JnKJIWI Itl IKiJI 


Chilli rr. 


rnxinu 


I'l lli*_^IIIIJIS_j?t_p^M \r.zj inhlll 






Jat.JITGllHlLrf.LI 


^sltlrcii ta run 


rirsuu.2 


Htuinjuii.iii.stjinvsksjaiauj 




_i 


JoBJllKillBlffl.U 


Halting ta run 


rsrsUUS 


Htuinjuii.iii.stjinrsksjaiauj 






juiijmaraja.D 


»J i U IEj L'J ITJII 


riniiilS 


PCHnJKHJP.iLuliriii.iJMK) 






jiiii_i™iiin.H_n 






nmi«jisnji!_xi_iii0Kii:=jS"nisi 






jiildim ini m _n 


>" mi 


iiinSBS 


nriHi^3isiiji!jiUiipKii:=j!"nisi 




/ 


jiiiLiiiffiiinyiui 


Willi!! Ill lllll 




I'l lliBJilllljn jc 1 JH0« n:= J juiilll 


llHlm:* | HhUJ 


x | *i. | | 


■^ixpHllI | IIHIIIIM | IXlL | | 



FIG. 7: Batch job control with MySQL back-end. 



A. OTHER SYSTEM ADMINISTRATION 
TOOLS 

The Linux Farm also uses RCF-designed PYTHON- 
based scripts for fast, parallel access to the Linux 
Farm servers. The scripts are used for routine system 




rcrtOJ fUJ - MC-i 



^uotAnr ? . : .hi l.il 
K-yJf." :^.-tHii 

— - 1 1 =h ±a -i : : w rit 
I-iim-h - : : ".h.i"i:;i 
KiyKZZ -i. Kn.il 
r i ^rl*.:| .F.*r| -v * ■ :rKr| ft 1 1| \t :"ii:"K 

|'K rAjiLh..>}|Trj-:.j-iiLj{r[T : ij> i: t . l:^ | »khi t.** !L -*h ^ rs'-ji- ■> 

^ * : k t -rj ■ i ■jifyrr ■ -fr'f iff jM h.tJf 

' i- — i'l jii' fc " il ■ ivi * -:r 'ii I Lr mi ■ ■ | ri L r I' 



;^i.;ltJU .^..|' \ yt. t'\ r n't l i>j >/i.jil.m. r j 

lvi.jh wu"-. k.v.1' .tvi. j.im - u).ir.nj rj. ^^ 

L^,.;^[M.FJ..l■ .>+■>: JIliH) .'.Hi! Sj..;i[.m. 

;u:i[iu.ij..ii . y>:-i?»-: ««>.ij:m» h:t:-"- <<i 

Jm.: i En. Hi. .■■ :>K ni>i UIM U'A ij.n- 1^ 

i P"*i F- ■ l :-,-^..Ci;.i ■ YCTl ■H^^L■VI'■ '..V li'-'i I.i 

■\.-iF/i-if- ■ l ..^.-i-i'i.' '^OiiiVlli ro>x r ■ ^■^ 

■\--iF/nF- ■ l .-LVI'1.' 'HliLiVn I.^T' ■ t: 

ih#f - hi"K i-ir-: ^n^h'Lr >n r:-- ! ; ■ n 

ih#f - hi"K ■■.■■■i'l I -t:- i -■■ i: i - ./i"' - ii.*r \.\ 

i |.jvr" h" -"il "v* - ! i. ?a\ *.i i**i'tr ; m I ruT' ^-r - *K 

:ii::ilki:h::i- ■ ■■ ■ yi. i. cjj.l "'" ■>" tn-CL r: 

a 1 — T 



FIG. 8: Vendor provided remote power management soft- 
ware. 



VI. CONCLUSION 

Scalable system software has become an important 
factor to the RCF for efficiently deploying and man- 
aging our rapidly growing Linux cluster. It allows us 
to monitor the status of individual cluster servers in 
near-real time, to deploy our Linux image in a fast 
and reliable fashion across the cluster and to access 
the cluster in a fast, parallel manner. 

Because not all of our system software needs can be 
addressed from a single source, it has become neces- 
sary for us to use a mix of RCF-designed, open-source 
and vendor-provided software to achieve our goal of a 
scalable system software architecture. 
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FIG. 9: Example of scalable cluster management tool. 
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